Setonix is a world class supercomputer, delivering over 27 Petaflops of floating point performance using AMD EPYC CPUs and Instinct MI250x GPUs. Currently Setonix sits in place 15 on the TOP 500 list of the world's most powerful computers.
The Pawsey Documentation Portal should be your first point of call when looking for documentation. That source must take priority if there is any discrepancy between the official documentation and this material. On this page is some specific documentation for using GPU's on Setonix.
Firstly, you need a username and password to access Setonix. Your username and password will be given to you prior to the beginning of this workshop. If you are using your regular Pawsey account then you can reset your password here.
Access to Setonix is via Secure SHell (SSH). On Linux, Mac OS, and Windows 10 and higher an SSH client is available from the command line or terminal application. Otherwise you need to use a client program like Putty or MobaXterm.
On the command line use ssh to access Setonix.
ssh -Y <username>@setonix.pawsey.org.au
In order to avoid specifying a username and password on each login you can generate a keypair on your computer, like this
ssh-keygen -t rsa
Then copy the public key (the file that ends in *.pub) to your account on Setonix and append it to the authorized_keys file in ${HOME}/.ssh. On your machine run this command:
scp -r <filename>.pub <username>@setonix.pawsey.org.au
Then login to Setonix and run this command
mkdir -p ${HOME}/.ssh
cat <filename>.pub >> ${HOME}/.ssh/authorized_keys
chmod -R 0400 ${HOME}/.ssh
Then you can run
ssh <username>@setonix.pawsey.org.au
without a password.
If you have a OS that is older than Windows 10, and need a client in a hurry, then just download MobaXterm Home (Portable Edition) from this location. Extract the Zip file and run the application. You might need to accept a firewall notification.
Now go to Settings -> SSH and uncheck "Enable graphical SSH-browser" in the SSH-browser settings pane. Also enable "SSH keepalive" to keep SSH connections active.
Close the MobaXTerm settings and start a local terminal.
On Setonix there are two main kinds of compute nodes:
CPU nodes are based on the AMD™ EPYC™ 7763 processor in a dual-socket configuration. Each processor has a multi-chip design with 8 chiplets (Core CompleX's). Shown below is a near infrared image of an EPYC processor, showing 8 chiplets and an IO die.
Each chiplet has 8 cores, and these cores share access to a 32 MB L3 cache. Every core has its own L1 and L2 cache, provides 2 hardware threads, and has access to SIMD units that can perform floating point math on vectors up to 256 bits (8x32-bit floats) wide in a single clock cycle. There are 16 hardware threads available per chiplet. Since every processor has 8 chiplets, there are a total of 64 cores 128 threads per processor; and 128 cores 256 threads per node. Here is some cache and performance information for the AMD Epyc 7763 CPU.
| Node | CPU | Base clock freq(GHz) | Peak clock freq (GHz) | Cores | Hardware threads | L1 Cache (KB) | L2 Cache (KB) | L3 cache (MB) | FP SIMD width (bits) | Peak TFLOPs (FP32) |
|---|---|---|---|---|---|---|---|---|---|---|
| CPU | AMD EPYC 7763 | 2.45 | 3.50 | 64 | 128 | 64x32 | 64x512 | 8x32 | 256 | ~1.79 |
Below is an image of a CPU compute blade on Setonix, in this shot there are 8 CPU heatsinks for a total of four nodes per blade.
GPU nodes on Setonix have one AMD 7A53 'Trento' CPU processor and four MI250X GPU processors. The CPU is a specially-optimized version of the EPYC processor used in the CPU nodes, but otherwise has the same design and architecture. The Instinct™ MI250X processor is also a Multi-Chip Module (MCM) design, with two graphics dies (otherwise known as Graphics Complex Dies) per processor, as shown below.
Each of the two Graphics Compute Dies (GCD's) in a MI250X appears to HIP as a individual compute device with its own 64 GB of global memory and 8MB of L2 cache. Since there are four MI250X's, there are a total of 8 GPU compute devices visible to HIP per GPU node. The compute devices have 110 compute units, and each compute unit executes instructions over a bank of 4x16 floating point SIMD units that share a 16KB L1 cache, as seen below:
The interesting thing to note with these compute units is that both 64-bit and 32-bit floating instructions are executed natively at the same rate. Therefore only the increased bandwidth requirements for moving 64-bit numbers around is a performance consideration. Below is a table of performance numbers for each of the four dual-gpu MI250X processors in a gpu node.
| Card | Boost clock (GHz) | Compute Units | FP32 Processing Elements | FP64 Processing Elements (equivalent compute capacity) | L1 Cache (KB) | L2 Cache (MB) | device memory (GB) | Peak Tflops (FP32) | Peak Tflops (FP64) |
|---|---|---|---|---|---|---|---|---|---|
| AMD Radeon Instinct MI250x | 1.7 | 2x110 | 2x7040 | 2x7040 | 2x110x16 | 2x8 | 2x64 | 47.9 | 47.9 |
Below is an installation image of a GPU compute blade with two nodes. Each node has 1 CPU socket and four GPU sockets.
On Setonix the following queues are available for use. A special account is needed to access the gpu queue. This will usually be your project name followed by the suffix -gpu.
| Queue | Max time limit | Processing elements (CPU) | Socket | Cores | processing elements per CPU core | Available memory (GB) | Number of HIP devices | Memory per HIP device (GB) |
|---|---|---|---|---|---|---|---|---|
| work | 24 hours | 256 | 2 | 64 | 2 | 230 | 0 | 0 |
| long | 96 hours | 256 | 2 | 64 | 2 | 230 | 0 | 0 |
| debug | 1 hour | 256 | 2 | 64 | 2 | 230 | 0 | 0 |
| highmem | 24 hours | 256 | 2 | 64 | 2 | 980 | 0 | 0 |
| copy | 24 hours | 32 | 1 | 64 | 2 | 118 | 0 | 0 |
| gpu | 24 hours | 256 | 1 | 64 | 2 | 230 | 4x2 | 64 |
When compiling software or running test jobs on Setonix it is sometimes helpful to have interactive access to a gpu node. Allocations for the gpu queue on Setonix need a separate allocation. The following command reserves a gpu and 8 cores for interactive use. You can use this to compile software and run interactive jobs on a gpu node of Setonix, but for the workshop you might need to use the salloc command in the welcome letter.
salloc --account=${PAWSEY_PROJECT}-gpu --ntasks=1 --mem=8GB --cpus-per-task=8 --time=4:00:00 --gpus-per-task=1 --partition=gpu
The main complexity with building HIP enabled applications on Setonix is when also need support for MPI. Otherwise you can simply load the rocm module and use hipcc. Here are some suggested workflows if you need MPI support.
There are three main programming environments available on Setonix. Each provides C/C++ and Fortran compilers that build software with knowledge of of the MPI libraries available on Setonix. The PrgEnv-GNU programming environment loads the GNU compilers for best software compatibility, the module PrgEnv-aocc loads the AMD aocc optimising compiler to try and get the best performance from the AMD CPU's on Setonix, and the PrgEnv-cray environment loads the well-supported compilers from Cray. Use these commands to find which module to load.
| Programming environment | command to use |
|---|---|
| AMD | module avail PrgEnv-aocc |
| Cray | module avail PrgEnv-cray |
| GNU | module avail PrgEnv-gnu |
When compiling C/C++ HIP sources you have the choice of either the the ROCM hipcc compiler wrapper or the Cray compiler wrapper CC from PrgEnv-cray. If you use the Cray compiler wrapper you need to swap to the module PrgEnv-cray, as the GNU programming environment (PrgEnv-gnu) is loaded by default.
module swap PrgEnv-gnu PrgEnv-cray
Then the following compiler wrappers are available for use to compile source files:
| Command | Explanation |
|---|---|
| cc | C compiler |
| CC | C++ compiler |
| ftn | FORTRAN compiler |
In order to use a GPU-aware MPI library from Cray you also need to load the craype-accel-amd-gfx90a module, which is available in all three programming environments. Load the module with this command.
module load craype-accel-amd-gfx90a
then set this environment variable to enable GPU support with MPI.
export MPICH_GPU_SUPPORT_ENABLED=1
Finally, in order to have ROCM software (such as hipcc and rocgdb) and libraries available you need to load rocm/5.0.2 module.
module load rocm/5.0.2
The default rocm module (version 5.0.2) is independent of the programming environment. The module rocm/5.4.3 is currently an experimental trial of the latest ROCm software stack, however it needs the PrgEnv-gnu module loaded in order to work; and some functionality, (such as debugging), appears to be contingent on a future GPU driver update.
Omnitrace is a tool for using rocprof to collect traces, or information on when an application component starts using compute resources, and for how long it uses those resources. Currently you will need these modules loaded to access the experimental Omnitrace tools.
module load rocm/5.0.2
module use /software/projects/courses01/setonix/omnitrace/share/modulefiles
module load omnitrace/1.10.0
Omniperf is a tool to make low level information collected by rocprof accessible. It can perform feats like creating roofline models of how well your kernels are performing, in relation to the theoretical capability of the compute hardware. The following commands will help you access the experimental Omniperf tools.
module load cray-python
module load rocm/5.0.2
module use /software/projects/courses01/setonix/omniperf/1.0.8PR2/modulefiles
module load omniperf/1.0.8-PR2
According to this documentation the AMD compiler wrapper hipcc can be used for compiling HIP source files and is the suggested linker for program objects. The Cray C++ compiler also has the ability to compile HIP source code through adding the compiler option -x hip to CC, but you need to have the PrgEnv-cray environment loaded in order fo this to work.
In order provide the best chance of reducing compiler issues it is best practice to compile from a gpu node, either from a batch or interactive job. If you use hipcc to compile HIP source, then you can use another compiler to compile other sources and then use hipcc to link them.
You can use these compiler flags with hipcc to bring in the MPI headers and make sure hipcc compiles kernels for the MI250X architecture on Setonix. These flags work with hipcc in all of the programming environments.
| Function | flags |
|---|---|
| Compile | -I${MPICH_DIR}/include --offload-arch=gfx90a |
| Link | -L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a} |
| Debug (compile and link) | -O0 -g -ggdb |
| OpenMP (compile and link) | -fopenmp |
If you want hipcc to behave like the compiler wrapper CC from your chosen programming environment then make sure the craype-accel-amd-gfx90a module is also loaded. Then add the output of this command,
$(CC --cray-print-opts=cflags)
to the hipcc compile flags, and the output of this command,
$(CC --cray-print-opts=libs)
to the hipcc linker flags.
If you are using the C++ compiler wrapper CC from the PrgEnv-cray environment you can add these flags to compile and link HIP code for the MI250X GPU's on Setonix.
| Function | flags |
|---|---|
| Compile | -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -x hip |
| Link | |
| Debug (compile and link) | -g |
| OpenMP (compile and link) | -fopenmp |
This site from HPE Enterprise documentation explains what the compiler options are for. The option -D__HIP_ROCclr__ is necessary to use the ROCm Common Language Runtime interface, and the flags -D__HIP_ARCH_GFX90A__=1 and --offload-arch=gfx90a enable specific settings and device code for the gfx90a architecture in the MI250X GPUs. The flag -x hip informs CC that the file is HIP source.
From this documentation whenever you mix compilers it is important to ensure that all code links to the same C++ standard libraries. The command hipconfig --cxx generates extra compile flags that might be useful for including in the build process with the Cray wrappers.
In the files hello_devices_mpi.cpp and hello_devices_mpi_onefile.cpp are files to implement an MPI-enabled HIP application that reports on devices and fills a vector. The difference between the two is that for hello_devices_mpi.cpp has the kernel located in a separate file kernels.hip.cpp. Your task is to compile these files into two executables, hello_devices_mpi.exe and hello_devices_mpi_onefile.exe.
ssh <username>@setonix.pawsey.org.au
cd $MYSCRATCH
wget https://github.com/pelagos-consulting/HIP_Course/archive/refs/heads/main.zip
unzip -DD main.zip
cd HIP_Course-main/course_material/L2_Using_HIP_On_Setonix
salloc --account ${PAWSEY_PROJECT}-gpu --ntasks 1 --mem 8GB --cpus-per-task 8 --time 1:00:00 --gpus-per-task 1 --partition gpu
module swap PrgEnv-gnu PrgEnv-cray
module load rocm/5.0.2
module load craype-accel-amd-gfx90a
kernels.o for later linking.hipcc -c kernels.hip.cpp --offload-arch=gfx90a -o kernels.o
CC -c -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -x hip hello_devices_mpi.cpp -o hello_devices_mpi.o
hipcc kernels.o hello_devices_mpi.o -o hello_devices_mpi.exe -L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}
This should work with any programming environment.
hipcc -I${MPICH_DIR}/include --offload-arch=gfx90a hello_devices_mpi_onefile.cpp -o hello_devices_mpi_onefile_hipcc.exe -L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}
This only works with the PrgEnv-cray environment.
CC -D__HIP_ROCclr__ -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -x hip hello_devices_mpi_onefile.cpp -o hello_devices_mpi_onefile_CC.exe
If you are in an interactive or batch job then the proper number of compute devices should appear when we run these commands.
./hello_devices_mpi.exe
./hello_devices_mpi_onefile_hipcc.exe
./hello_devices_mpi_onefile_CC.exe
Try changing the number of GPU's in your request for resources for the interactive job. How many GPU's appear in the output from the commands above?
If you get stuck, the example Makefile contains the above compilation steps. Assuming you loaded the right modules defined above, the make command is run as follows:
make clean; make
The script run_compile.sh contains the necessary commands to load the appropriate modules and run the make command.
chmod 700 run_compile.sh
./run_compile.sh
Pawsey has extensive documentation available for running jobs, at this site. Here is some information that is specific to making best use of the GPU nodes on Setonix.
On the GPU nodes of Setonix there is 1 CPU and 8 compute devices. Each of the 8 chiplets in the CPU is intended to have optimal access to one of the 8 available GPU compute devices. Shown below is a hardware diagram of a compute node, where each chiplet is connected optimally to one compute device.
From the above diagram we see that best use of the GPU's occur when a chiplet accesses a GPU that is closest to it. Work is still being done on making sure that MPI processes map optimally to available compute devices, however these interim suggestions will help space out the MPI tasks so each task resides on its own chiplet.
The suggested job script below will allocate an MPI task for every compute device on a node of Setonix. Then it will allocate 8 OpenMP threads to each MPI task. We can use the helper program hello_jobstep.cpp adapted from a program by Thomas Papatheodore from ORNL. Every software thread executed by the program reports the MPI rank, OpenMP thread, the CPU hardware thread, as well as the GPU and BUS ID's of the GPU hardware.
#!/bin/bash -l
#SBATCH --account=<account>-gpu # your account
#SBATCH --partition=gpu # Using the gpu partition
#SBATCH --ntasks=8 # Total number of tasks
#SBATCH --ntasks-per-node=8 # Set this for 1 mpi task per compute device
#SBATCH --cpus-per-task=8 # How many OpenMP threads per MPI task
#SBATCH --threads-per-core=1 # How many OpenMP threads per core
#SBATCH --gpus-per-task=1 # How many HIP compute devices to allocate to a task
#SBATCH --gpu-bind=closest # Bind each MPI task to the nearest GPU
#SBATCH --mem=4000M #Indicate the amount of memory per node when asking for shared resources
#SBATCH --exclusive # Use this to request all the resources on a node
#SBATCH --time=00:05:00
module swap PrgEnv-gnu PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm
export MPICH_GPU_SUPPORT_ENABLED=1 # Enable GPU support with MPI
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK #To define the number of OpenMP threads available per MPI task, in this case it will be 8
export OMP_PLACES=cores #To bind to cores
export OMP_PROC_BIND=close #To bind (fix) threads (allocating them as close as possible). This option works together with the "places" indicated above, then: allocates threads in closest cores.
# Temporal workaround for avoiding Slingshot issues on shared nodes:
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
# Compile the software
make clean
make
# Run a job with task placement and $BIND_OPTIONS
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./hello_jobstep.exe | sort
In the file jobscript.sh is a batch script for the information above. Edit the <account> field to include the account to charge to. The value to use will be in the environment variable $PAWSEY_PROJECT.
echo $PAWSEY_PROJECT
Then submit the script to the batch queue with this command
sbatch jobscript.sh
Use this command to check on the progress of your job
squeue --me
Then if you need to you and you know the job id you can cancel a job with this command
scancel <jobID>
Once the job is done, have a look at the *.out file and examine how the threads and GPU's are placed.
In this section we cover using HIP on the Pawsey Supercomputer Setonix. This includes logins with SSH; hardware and software environments; and accessing the job queues through interactive and batch jobs. We conclude the chapter with the HIP software compilation process on Setonix, and then how to get the best performance in batch jobs by scheduling MPI tasks close to the available compute devices.